Skip to content

Add schema-aware intelligent column mapping pipeline#2

Open
arsh0198 wants to merge 2 commits into
sbabyanusha:mainfrom
arsh0198:add-normalization-layer
Open

Add schema-aware intelligent column mapping pipeline#2
arsh0198 wants to merge 2 commits into
sbabyanusha:mainfrom
arsh0198:add-normalization-layer

Conversation

@arsh0198

Copy link
Copy Markdown

Summary

This PR extends the supplemental data formatting workflow by introducing a schema-aware column mapping pipeline for heterogeneous supplemental datasets.

Added Features

  • Intelligent column normalization
  • Schema-aware fuzzy column mapping using RapidFuzz
  • Automatic schema detection logic
  • Validation for required schema fields
  • Modularized formatting pipeline architecture
  • CLI-based processing workflow

New Components

  • mapper.py
  • schemas.py
  • validator.py

Example

Input columns:

  • Tumor Sample Barcode
  • Gene Name
  • Patient Identifier

Automatically mapped to:

  • SAMPLE_ID
  • HUGO_SYMBOL
  • PATIENT_ID

Goal

This contribution moves the formatter toward a reusable schema-driven supplemental data curation pipeline and reduces manual preprocessing effort for curators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant